This vignette applies the UINMF algorithm
(Kriebel et al. 2021)
implemented in the LIGER package to integrate the single-cell RNA
sequencing (scRNA-Seq) and the mass spectrometry-based single-cell
proteomics (MS-SCP) data. This data set is provided by the
SingleCellMultiModal package. The major advantage of using the
recently published UINMF algorithm is that it does not require that
the different modalities are acquired from the same cell and it can
include features that are specific to each modality.
We use the SCoPE2 function from the SingleCellMultiModal package
to retrieve the MS-SCP and scRNA-Seq data set. SCoPE2 refers to the
name of the acquisition protocol for MS-SCP data.
library(SingleCellMultiModal)
mae <- SCoPE2("macrophage_differentiation",
modes = "rna|protein",
dry.run = FALSE)
## Warning in split.default(filepaths, fact): data length is not a multiple of
## split variable
mae
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] protein: SingleCellExperiment with 3042 rows and 1490 columns
## [2] rna: SingleCellExperiment with 32738 rows and 20274 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save all data to files
The mae object contains two assays containing either the MS-SCP data
(protein) or the scRNA-Seq data (rna). Refer to the SCoPE2 vignette
(vignette("SCoPE2", package = "SingleCellMultiModal")) for more
information about the data set.
We next need to prepare the data before integration. While the protein data has already been extensively processed, the RNA data has undergone only minimal processing. RNA data has already been filtered for cells that pass QC, but still need to be processed in order to select for highly variable features.
First, we remove features that were not detected. A gene is detected if it has at least one count in at least one of the cells.
sel <- rowSums(counts(mae[["rna"]]) > 0) > 0
table(sel)
## sel
## FALSE TRUE
## 10149 22589
About 30 % of the genes (10149 out of 32738) in the data were not detected. We remove them.
mae[["rna"]] <- mae[["rna"]][sel, ]
We used the scuttle package to normalization the scRNA-Seq data. This
is performed using the logNormCounts. Adding transform = "none"
will create a new data matrix containing the normalized counts before
log-transformation.
library(scuttle)
## Loading required package: SingleCellExperiment
mae[["rna"]] <- logNormCounts(mae[["rna"]], transform = "none")
mae[["rna"]] <- logNormCounts(mae[["rna"]])
Notice that two new data assay were added to the RNA assay, called
normcounts and logcounts.
assays(mae[["rna"]])
## List of length 3
## names(3): counts normcounts logcounts
For the scRNA-Seq data, we select highly variable genes (HVG).
Those can be identified by modelling the gene-variance relationship
from the logcounts data. This is performed using the modelGeneVar
from the scran package.
library(scran)
varModel <- modelGeneVar(mae[["rna"]])
This model decompose the gene variance into technical variability (fitted trend) and biological variability (residuals).
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.4 ✔ dplyr 1.0.7
## ✔ tidyr 1.1.3 ✔ stringr 1.4.0
## ✔ readr 2.0.1 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::collapse() masks IRanges::collapse()
## ✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::count() masks matrixStats::count()
## ✖ dplyr::desc() masks IRanges::desc()
## ✖ tidyr::expand() masks S4Vectors::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks S4Vectors::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename() masks S4Vectors::rename()
## ✖ dplyr::slice() masks IRanges::slice()
ggplot(data.frame(varModel)) +
aes(x = mean) +
geom_point(aes(y = total)) +
geom_line(aes(y = tech), colour = "red") +
ylab("variance")
We filter the 2000 genes that have highest biological variability, constraint to features with a false discover rate smaller than 0.3 (the hypotheses being the biological variability is greater than the technical variability against there are no difference between the two types of variability). We store the selection results in the data set.
topHvg <- getTopHVGs(varModel,
n = 2000,
var.field = "bio",
fdr.field = "FDR",
fdr.threshold = 0.3)
rowData(mae[["rna"]])$top2000HVG <- rownames(mae[["rna"]]) %in% topHvg
For the MS-SCP data, we cannot model the mean-variance relationship since the data already went through protein-wise normalization. Therefore the distribution of the average protein expression is centered around zero. We will only focus on the protein variance
vars <- rowVars(assay(mae[["protein"]]))
hist(vars)
Although this approach is far from being perfect and highlight the current lack of principled MS-SCP data processing workflows, we select the top 2000 proteins that have highest variance.
thres <- sort(vars, decreasing = TRUE)[2000]
rowData(mae[["protein"]])$top2000HVP <- vars >= thres
Before proceeding to the integration, we need to convert the features names (genes and proteins) to a common naming. We decided to convert to HGNC symbols because this type of annotation is easier to interpret from downstream analysis results.
The conversion is performed using biomaRt and relying on the Ensembl
database. We use the human data base since cells in the data were
isolated from human cell line cultures.
library(biomaRt)
human <- useMart("ensembl", "hsapiens_gene_ensembl")
The features names in the rna assay are already encoded as HGNC
symbols and so do not require additional processing.
head(rownames(mae[["rna"]]))
## [1] "RP11-34P13.7" "RP11-34P13.8" "AL627309.1" "RP11-34P13.9"
## [5] "AP006222.2" "RP4-669L17.10"
The feature names in the protein assay are however encoded as Uniprot
protein IDs that does not match the symbols in the rna assay.
head(rownames(mae[["protein"]]))
## [1] "A0A075B6H9" "A0A0B4J1V0" "A0A0B4J237" "A0A1B0GTH6" "A0A1B0GUA6"
## [6] "A0A1B0GUU1"
We therefore convert those proteins IDs to HGNC symbols. Note that
HGNC symbols are not available for all protein IDs, resulting in some
gene symbols to be NA. Proteins with unassigned HGNC symbols are
removed from the analysis.
protIds <- rownames(mae[["protein"]])
conversion <- getBM(attributes = c("uniprot_gn_id", "hgnc_symbol"),
filters = "uniprot_gn_id",
values = protIds,
mart = human,
uniqueRows = TRUE)
rownames(mae[["protein"]]) <-
conversion$hgnc_symbol[match(protIds, conversion$uniprot_gn_id)]
mae[["protein"]] <- mae[["protein"]][!is.na(rownames(mae[["protein"]])), ]
Now that the feature names for both modalities are encoded using the same convention, we can check the overlap between the two assays.
all <- union(rownames(mae[["rna"]]),
rownames(mae[["protein"]]))
table(`in rna` = all %in% rownames(mae[["rna"]]),
`in protein` = all %in% rownames(mae[["protein"]]))
## in protein
## in rna FALSE TRUE
## FALSE 0 384
## TRUE 20035 2520
Export data to csv files for processing/exploring with third-paty tools.
(The code chunks below aren’t evaluated)
write.csv(colData(mae[["rna"]]), file = "../data/rna_sample_annotation.csv")
write.csv(colData(mae[["protein"]]), file = "../data/protein_sample_annotation.csv")
write.csv(assay(mae[["rna"]]), file = "../data/rna_expression.csv")
write.csv(assay(mae[["protein"]]), file = "../data/protein_expression.csv")
The integration performed in this vignette uses the UINMF algorithm described in Kriebel et al. 2021. The algorithm decomposes the data (\([E_1, P_1]\) and [E_2, P_2]) in different latent components. The underlying model is illustrated in this figure taken from the original article.
The UINMF algorithm is available from the liger package. Currently, this
available on the U_algorithm branch of the GitHub repository.
## BiocManager::install("welch-lab/liger", ref = "U_algorithm")
library(liger)
The LIGER package implements its own data structure. We therefore
need to create a liger object from the mae object. Since the data
are easily accessible, this conversion is straightforward to perform.
Note, that we don’t use the log-transformed data but rather the data
on the count scale. This is because we here assume that the biological
processes can be modelled by additive effects. This assumption would
not hold when integrating the data in logarithmic scale. We therefore
extract the normalized counts for the scRNA-Seq data and exponentiate
the MS-SCP data.
lig <- createLiger(list(rna = normcounts(mae[["rna"]]),
protein = 2^assay(mae[["protein"]])),
make.sparse = TRUE,
take.gene.union = FALSE,
remove.missing = TRUE)
Normalize data by dividing by the summed expression per cell.
lig <- liger::normalize(lig)
This section will perform the integration so to speak. We first need to scale each columns (cell) to unit variance. This is to provide the same weights to all cells during the integration.
lig <- scaleNotCenter(object = lig,
remove.missing = TRUE)
Next, we fit the UINMF model to the data using the optimizeALS
function.
lig <- optimizeALS(lig,
k = 30, ## Number of latent factors
lambda = 5,
thresh = 1e-06,
max.iters = 30,
use.unshared = TRUE)
## [1] "Performing Factorization using UINMF and unshared features"
## [1] "Processing"
##
|
| | 0%
|
|===== | 7%
|
|======= | 10%
|
|========= | 13%
|
|============ | 17%
|
|============== | 20%
|
|================ | 23%
|
|=================== | 27%
|
|===================== | 30%
|
|======================= | 33%
|
|========================== | 37%
|
|============================ | 40%
|
|============================== | 43%
|
|================================= | 47%
|
|=================================== | 50%
|
|===================================== | 53%
|
|======================================== | 57%
|
|========================================== | 60%
|
|============================================ | 63%
|
|=============================================== | 67%
|
|================================================= | 70%
|
|=================================================== | 73%
|
|====================================================== | 77%
|
|======================================================== | 80%
|
|========================================================== | 83%
|
|============================================================= | 87%
|
|=============================================================== | 90%
|
|================================================================= | 93%
|
|==================================================================== | 97%
|
|======================================================================| 100%
## Current seed 1 current objective 40782154
## Best results with seed 1.
Finally, we combine the \(H_{rna}\) and \(H_{protein}\) matrices in a
single \(H\) matrix using quantile_norm function. This allows to
integrate the protein and transcriptomic latent spaces in a common
latent space.
lig <- quantile_norm(lig, ref_dataset = "rna")
metadata(mae)$Hnorm <- lig@H.norm
metadata(mae)$W <- lig@W
metadata(mae)$V <- lig@V
We now assess the integration of the proteomic to the transcriptomic
data. Each cell is characterized by 30 latent factors. For easier
visualization, we reduce the factors to two dimensions by applying
UMAP to the integrated \(H\) matrix. This is done internally by the
runUMAP function from the liger package.
library(scater)
##
## Attaching package: 'scater'
## The following objects are masked from 'package:liger':
##
## runTSNE, runUMAP
umap <- calculateUMAP(metadata(mae)$Hnorm,
ncomponents = 2,
scale = TRUE,
n_neighbors = 10,
transpose = TRUE)
rownames(umap) <- rownames(metadata(mae)$Hnorm)
metadata(mae)$UMAP <- umap
Then, using some data manipulation to extract the dimension reduction and add the modality, we can visualize the data integration.
df <- data.frame(UMAP = metadata(mae)$UMAP)
df <- cbind(df, colData(mae)[rownames(df), ])
df$Modality <- ifelse(startsWith(rownames(df), "i"), "protein", "rna")
gg <- ggplot(df) +
aes(x = UMAP.1,
y = UMAP.2,
colour = Modality) +
geom_point(size = 0.5, alpha = 0.3) +
theme_minimal() +
theme(legend.position = "bottom") +
xlab("UMAP 1") +
ylab("UMAP 2")
gg + scale_colour_manual(values = c("#7332a8", "orange2"),
na.value = "grey")
From this graph, we can see that protein and RNA data do overlap indicating that the cells that were analyzed do indeed come from the same experimental design.
gg + aes(colour = celltype)
library(scran)
library(igraph)
g <- buildSNNGraph(t(metadata(mae)$Hnorm),
k = 15,
type = "rank")
clust <- cluster_louvain(g)$membership
names(clust) <- rownames(metadata(mae)$Hnorm)
colData(mae)[names(clust), "cluster"] <- paste0("cl", clust)
df <- data.frame(UMAP = metadata(mae)$UMAP)
df <- cbind(df, colData(mae)[rownames(df), ])
df$Modality <- ifelse(startsWith(rownames(df), "i"), "protein", "rna")
ggplot(df) +
aes(x = UMAP.1,
y = UMAP.2,
colour = cluster) +
geom_point(size = 0.5, alpha = 0.3) +
geom_label(data = . %>% group_by(cluster) %>%
summarize(UMAP.1 = median(UMAP.1),
UMAP.2 = median(UMAP.2)),
aes(label = cluster), size = 2) +
theme_minimal() +
xlab("UMAP 1") +
ylab("UMAP 2")
expRatio <- sum(df$Modality == "rna") / nrow(df)
group_by(df, cluster, Modality) %>%
summarise(n = n()) %>%
group_by(cluster) %>%
mutate(p = n / sum(n)) %>%
ggplot() +
aes(x = cluster,
y = p,
fill = Modality) +
geom_bar(stat = "identity") +
geom_hline(yintercept = expRatio) +
ylab("Proportion") +
theme_minimal()
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
randRatio <-
sum(!is.na(df$celltype) & df$celltype == "Monocyte") / sum(!is.na(df$celltype))
filter(df, Modality == "protein") %>%
group_by(cluster, celltype) %>%
summarise(n = n()) %>%
group_by(cluster) %>%
mutate(p = n / sum(n)) %>%
ggplot() +
aes(x = cluster,
y = p,
fill = celltype) +
geom_bar(stat = "identity") +
geom_hline(yintercept = randRatio) +
ylab("Proportion") +
theme_minimal()
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
filter(df, Modality == "protein") %>%
group_by(cluster, batch_MS) %>%
summarise(n = n()) %>%
rbind(data.frame(cluster = "expected",
batch_MS = unique(df$batch_MS),
n = 1)) %>%
group_by(cluster) %>%
mutate(p = n / sum(n)) %>%
ggplot() +
aes(x = cluster,
y = p,
fill = batch_MS) +
geom_bar(stat = "identity") +
ylab("Proportion") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
rna <- getWithColData(mae, "rna")
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 1490 sampleMap rows not in names(experiments)
## removing 1490 colData rownames not in sampleMap 'primary'
res <- findMarkers(rna, groups = rna$cluster,
pval.type = "some",
min.prop = 0.7,
assay.type = "logcounts")
markers <- c("OASL", "DDX58", "ISG15", "IFIT3", "OAS1")
plots <- lapply(markers, function(marker) {
data.frame(exprs = logcounts(rna)[marker, ],
UMAP = metadata(mae)$UMAP[colnames(rna), ],
colData(rna)[colnames(rna), ]) %>%
ggplot +
aes(x = UMAP.1,
y = UMAP.2,
colour = exprs) +
geom_point(size = 0.3, alpha = 0.3) +
ggtitle(marker) +
scale_color_continuous(type = "viridis") +
theme_minimal() +
xlab("UMAP 1") +
ylab("UMAP 2")
})
wrap_plots(plots)
markers <- c("PLAUR", "ITGB1", "CD82", "TGFB1", "C5AR1", "PDIA3", "HSPA5", "CD44", "MMP1")
plots <- lapply(markers, function(marker) {
data.frame(exprs = logcounts(rna)[marker, ],
UMAP = metadata(mae)$UMAP[colnames(rna), ],
colData(rna)[colnames(rna), ]) %>%
ggplot +
aes(x = UMAP.1,
y = UMAP.2,
colour = exprs) +
geom_point(size = 0.3, alpha = 0.3) +
ggtitle(marker) +
scale_color_continuous(type = "viridis") +
theme_minimal() +
xlab("UMAP 1") +
ylab("UMAP 2")
})
wrap_plots(plots)
markers <- c("MALAT1", "NEAT1", "MYO1F", "N4BP2L2")
plots <- lapply(markers, function(marker) {
data.frame(exprs = logcounts(rna)[marker, ],
UMAP = metadata(mae)$UMAP[colnames(rna), ],
colData(rna)[colnames(rna), ]) %>%
ggplot +
aes(x = UMAP.1,
y = UMAP.2,
colour = exprs) +
geom_point(size = 0.3, alpha = 0.3) +
ggtitle(marker) +
scale_color_continuous(type = "viridis") +
theme_minimal() +
xlab("UMAP 1") +
ylab("UMAP 2")
})
wrap_plots(plots)
markers <- c("S100A8", "S100A9", "ALOX5AP", "TREM2", "TYROBP", "FTH1", "LSP1")
plots <- lapply(markers, function(marker) {
data.frame(exprs = logcounts(rna)[marker, ],
UMAP = metadata(mae)$UMAP[colnames(rna), ],
colData(rna)[colnames(rna), ]) %>%
ggplot +
aes(x = UMAP.1,
y = UMAP.2,
colour = exprs) +
geom_point(size = 0.3, alpha = 0.3) +
ggtitle(marker) +
scale_color_continuous(type = "viridis") +
theme_minimal() +
xlab("UMAP 1") +
ylab("UMAP 2")
})
wrap_plots(plots)
markers <- c("SPP1", "PLAUR", "TIMP1", "MMP1", "CCL2", "CCL3", "CCL7", "FCER1G", "CTSB", "TDO2", "ELANE", "IL32", "AZU1", "RPSA", "DEK", "RBMX")
plots <- lapply(markers, function(marker) {
data.frame(exprs = logcounts(rna)[marker, ],
UMAP = metadata(mae)$UMAP[colnames(rna), ],
colData(rna)[colnames(rna), ]) %>%
ggplot +
aes(x = UMAP.1,
y = UMAP.2,
colour = exprs) +
geom_point(size = 0.3, alpha = 0.3) +
ggtitle(marker) +
scale_color_continuous(type = "viridis") +
theme_minimal() +
xlab("UMAP 1") +
ylab("UMAP 2")
})
wrap_plots(plots)